#5 Basic NLP with Command-line
Faculty of Humanities and Social Sciences
University of Lucerne
23 March 2024
Historical development of Swiss party politics (Tagesanzeiger)
.txt)
.csv, .tsv, .xml)
. . .
.tsv file= n_occurrences= n_occurrences / n_total_wordsPrint the following sentence in your command line using echo.
How many words are in this sentence? Use the pipe operator | to pass the output above to the command wc.
Match the words computational and colorize its occurences in the sentence using egrep.
Get the frequencies of each word in this sentence using tr and other commands.
Save the frequencies into a tsv-file, open it in a spreadsheet programm (e.g., Excel, Numbers) and compute the relative frequency per word.
Are there some words that, although different, should be considered as the same?
We searched for exact matches until now. But …
How to find all words starting with the letter A?
… specific parts in texts
\w represents alphanumeric charactersX times? zero or one+ one or more* zero or any number{n}, {m,n} a specified number of times⚠️ Do not confuse regex with Bash wildcards!
[...] any of the characters between brackets
[auoei][0-9][A-Z][a-z]. matches any character (excl. newline)\ escapes to match literal
\. means the literal . instead of “any symbol”\w matches any alpha-numeric character
[A-Za-z0-9_]\s matches any whitespace (space, newline, tab)
[ \t\n].* 💪Match any character any times
git pull. When you haven’t cloned the repository, follow section 5 of the installation guide .KED2024/materials/data/swiss_party_programmes/txt. Change into that directory using cd.more.Compare the absolute frequencies of single terms or multi-word phrases of your choice (e.g., Ökologie, Sicherheit, Schweiz)…
Use the file names as filter to get various aggregation of the word counts.
Pick terms of your interest and look at their contextual use by extracting relevant passages. Does the usage differ across parties or time?
When you look for useful primers on Bash, consider the following resources: